Summary

  1. Since the distribution of the response variable looks normally distributed, I will use a linear regression model.
  2. New features:
  • weekdays - 2 levels: Workday; Weekend
  • SumWin - 4 levels: Winter2; Summer1; Summer2; Winter1
  • aftOpen - how many days after the open date
  • toClose - how many days to close date
  • Trs$HolBef - The day before is a holiday
  • HolAft - The day after is a holiday
  • tempLev - 2 levels: Hot(>40); Cold(<=40)
  • prev1Prof - the sum of profit on the previous 1 day
  • prev5Prof - the sum of profit in the previous 5 days
  • prev1Temp - the sum of temperature on the previous 1 day
  • prev5Temp - the sum of temperature on the previous 5 day
  • DofNoProfBef - How many consecutive days of no profit before the current day
    3 levels: 0; 1-4; >=5
  • prevDRain - Did it rain the day before
    3 levels: Unknow; No; Yes
  • DofRainfBef - How many consecutive days of rain before the current day
    3 levels: 0; 1-2; >=3
  1. Add new features, original features, and some interaction into our model. Use 10 folds cross-validation to select features and reduce overfitting.
  2. Random choose 6000 data as training data and the data left are testing data. Fit the model by training data. The true sum of profit for testing data is 447765 and the predicted sum of profit is 443778.9
  1. Use all data we know to fit the model. Use the model to predict profit from January 1st through October 30th of 2022 for 3 condidate locations
location_id PredictedProfit
11 17716
12 17294
13 16061
  1. Future study:
  • Since it’s periodic data, we can also try AR Model.

  • For real data study, I may also consider finance as a feature. When going to a new location to set up transactions, we should also consider nearby competitors. And in the beginning, when users are not yet familiar with this company, the profit will be much less than in other locations.

  • When I divide features into different levels, I can use cluster and other methods.

  • When I fill in missing values, I can fit a model (regression/classification) to predict those missing values.

  • If the same transaction_id means the same customer, we can add this information to our model to make a better prediction.

  • I can also add more features from transformation by other features in the model.

  • Calculate the significance of each feature.

  • Obviously, our estimation is biased. That’s because we are using the Lasso estimator of coefficients of covariates. There are some debiased estimations.

Read Data

Hol <- read.csv("data/holiday_data.csv") %>% as_tibble()
Loc <- read.csv("data/location_data.csv") %>% as_tibble()
Wet <- read.csv("data/weather_data.csv") %>% as_tibble()
Tr1 <- fromJSON(file = "data/transactions_001_system3.json")
Tr2 <- fromJSON(file = "data/transactions_002_system3.json")
Tr3 <- read.table("data/transactions_003_system2.txt", header = T)
Tr4 <- read.table("data/transactions_004_system2.txt", header = T)
Tr5 <- fromJSON(file = "data/transactions_005_system3.json")
Tr6 <- read.table("data/transactions_006_system2.txt", header = T)
Tr7 <- read.table("data/transactions_007_system2.txt", header = T)
Tr8 <- read.csv("data/transactions_008_system1.csv")
Tr9 <- read.table("data/transactions_009_system2.txt", header = T)
Tr10 <- fromJSON(file = "data/transactions_010_system3.json")

Some Modifications

Data Type

Transform the data type for each column

transJSON <- function(table){
  Tab <- table %>%
  sapply(function(l)
    c(ifelse(is.null(l$location_id), NA, l$location_id), 
      ifelse(is.null(l$date), NA, l$date), 
      ifelse(is.null(l$transaction_id), NA, l$transaction_id), 
      ifelse(is.null(l$profit), NA, l$profit))) %>%
  data.frame(row.names = c("location_id", "date", 
                           "transaction_id", "profit")) %>% 
    t %>% as.data.frame() %>% 
    mutate(location_id = as.integer(location_id),
           date = as.Date(date),
           transaction_id = as.integer(transaction_id),
           profit = as.numeric(profit))
  return(Tab)
}
Tr1 <- Tr1 %>% transJSON %>% as_tibble()
Tr2 <- Tr2 %>% transJSON %>% as_tibble()
Tr5 <- Tr5 %>% transJSON %>% as_tibble()
Tr10 <- Tr10 %>% transJSON %>% as_tibble()
transtxt <- function(table){
  Tab <- table %>%
    mutate(location_id = as.numeric(location_id),
           date = as.Date(date, tryFormats = c("%m-%d-%Y")),
           transaction_id = as.numeric(transaction_id),
           profit = profit %>% strsplit("-") %>%
             sapply(function(l)l[length(l)]) %>% 
             as.numeric()) %>% as_tibble()
  return(Tab)
}
Tr3 <- Tr3 %>% transtxt()
Tr4 <- Tr4 %>% transtxt()
Tr6 <- Tr6 %>% transtxt()
Tr7 <- Tr7 %>% transtxt()
Tr9 <- Tr9 %>% transtxt()
Tr8 <- Tr8 %>% mutate(location_id = as.integer(location_id),
               date = as.Date(date, tryFormats = c("%m/%d/%Y")),
               transaction_id = as.integer(transaction_id),
               profit = profit %>% substring(2) %>% as.numeric()) %>% 
  as_tibble()

Missing Data

Deal Missing Data in Table transactions_001 - 010

for (i in 1:10) {
  print(get(paste0("Tr",i)) %>% is.na() %>% colSums())
}
##    location_id           date transaction_id         profit 
##              0              0              0              2 
##    location_id           date transaction_id         profit 
##              0              0              0              3 
##    location_id           date transaction_id         profit 
##              0              0              0              1 
##    location_id           date transaction_id         profit 
##              0              0              0              6 
##    location_id           date transaction_id         profit 
##              0              0              0              5 
##    location_id           date transaction_id         profit 
##              0              0              0              3 
##    location_id           date transaction_id         profit 
##              0              0              0              6 
##    location_id           date transaction_id         profit 
##              0              0              0              0 
##    location_id           date transaction_id         profit 
##              0              0              0              0 
##    location_id           date transaction_id         profit 
##              0              0              0              1

In most of these 10 tables, missing data exist in the final column, profit.

Before I do anything, I should clarify the reason for missing data. If missing means profit equals 0, I can just remove those rows or fill them by 0. If the missing value means missing of record, deleting the row will lead miss of information. In this case, if we want more accurate values, we can fill those missing values by the mean profit value on a corresponding day. I’m not sure if the same transaction id was made by the same customer. If it is, I may also find the mean value of the profit with the same transaction id in the previous and the next day if there exists. A more complex approach is k nearest neighbor or fits a regression to ‘predict’ those missing values. We can even use the EM algorithm if you think those missing values are extremely important.

After applying the following codes

for (i in 1:10) {
  print(get(paste0("Tr",i)) %>% 
          (function(t) 
            t[t$date %in% (t$date[t$profit %>% is.na()]),]))
}
## # A tibble: 2 × 4
##   location_id date       transaction_id profit
##         <int> <date>              <int>  <dbl>
## 1           1 2020-02-07              1     NA
## 2           1 2020-02-14              1     NA
## # A tibble: 3 × 4
##   location_id date       transaction_id profit
##         <int> <date>              <int>  <dbl>
## 1           2 2020-05-29              1     NA
## 2           2 2020-05-30              1     NA
## 3           2 2020-06-15              1     NA
## # A tibble: 1 × 4
##   location_id date       transaction_id profit
##         <dbl> <date>              <dbl>  <dbl>
## 1           3 2020-02-10              1     NA
## # A tibble: 6 × 4
##   location_id date       transaction_id profit
##         <dbl> <date>              <dbl>  <dbl>
## 1           4 2020-02-07              1     NA
## 2           4 2020-02-14              1     NA
## 3           4 2020-05-28              1     NA
## 4           4 2020-05-30              1     NA
## 5           4 2021-02-05              1     NA
## 6           4 2021-02-17              1     NA
## # A tibble: 5 × 4
##   location_id date       transaction_id profit
##         <int> <date>              <int>  <dbl>
## 1           5 2019-02-26              1     NA
## 2           5 2020-02-17              1     NA
## 3           5 2020-05-28              1     NA
## 4           5 2020-06-01              1     NA
## 5           5 2021-02-03              1     NA
## # A tibble: 3 × 4
##   location_id date       transaction_id profit
##         <dbl> <date>              <dbl>  <dbl>
## 1           6 2020-02-05              1     NA
## 2           6 2020-05-28              1     NA
## 3           6 2021-02-10              1     NA
## # A tibble: 6 × 4
##   location_id date       transaction_id profit
##         <dbl> <date>              <dbl>  <dbl>
## 1           7 2020-02-17              1     NA
## 2           7 2020-05-28              1     NA
## 3           7 2020-05-29              1     NA
## 4           7 2021-02-15              1     NA
## 5           7 2021-02-17              1     NA
## 6           7 2021-02-18              1     NA
## # A tibble: 0 × 4
## # … with 4 variables: location_id <int>, date <date>, transaction_id <int>,
## #   profit <dbl>
## # A tibble: 0 × 4
## # … with 4 variables: location_id <dbl>, date <date>, transaction_id <dbl>,
## #   profit <dbl>
## # A tibble: 1 × 4
##   location_id date       transaction_id profit
##         <int> <date>              <int>  <dbl>
## 1          10 2021-02-02              1     NA

I found that there are no records on the same day at that location for all the missing values. Thus, I assume missing values are wrong records. Then, I just delete those rows.

for (i in 1:10) {
  assign(paste0("Tr",i),
         get(paste0("Tr",i)) %>% na.omit())
}

Other things

# Sum profits from diff. transactions in the same day.
trans1day <- function(table){
  table <- table %>% group_by(date) %>% 
  summarise(location_id = unique(location_id),
            profit = sum(profit),
            n=n())
  return(table)
}
for (i in 1:10) {
  assign(paste0("Tr", i), get(paste0("Tr", i)) %>% trans1day %>% 
           subset(select=c(2,1,3,4)))
}

# Combined data from different locations into same table
Trs <- rbind(Tr1, Tr2, Tr3, Tr4, Tr5,
             Tr6, Tr7, Tr8, Tr9, Tr10)

# Combine the location and weather information with the `Trs` table
# Trs <- left_join(Trs, Loc, by = "location_id")
Trs <- expand.grid(location_id=1:13,
            date = seq(from = as.Date("2019-01-01"),
                       to = as.Date("2022-10-30"),
                       by = "day")) %>% 
  as_tibble() %>% 
  left_join(Trs, by = c("location_id", "date")) %>% 
  left_join(Loc, by = "location_id")

# Add a column to describe the year of the date
Trs$year <- as.numeric(format(Trs$date,"%Y"))
# Add a column to describe the Month and Day
Trs$md <- as.character(format(Trs$date,"%m-%d"))
# Add a column to describe the Weekday
Trs$weekdays <- 
  factor(format(Trs$date,"%a"), 
         levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))

# Change the Date Type of "date" in Holiday table and Weather table
Hol$date <- Hol$date %>% as.Date()
Wet$date <- Wet$date %>% as.Date()

# Make sure there is only one weather data in each day at each location
Wet %>% group_by(location_id, date) %>% 
  summarise(n=n()) %>% `$`(n) %>% `==`(1) %>% all()
## `summarise()` has grouped output by 'location_id'. You can override using the
## `.groups` argument.
## [1] TRUE

Exploratory Analysis

Trs table

Response: profit

Check the distribution of the response variable:

Trs$profit %>% hist

This response variable looks normally distributed. Thus, I will use a linear regression model instead of generalized linear regression.

Plot the date vs. profit:

Trs %>% filter(location_id<=10) %>% 
  ggplot(aes(x = date, y = profit))+
  geom_point()+
  facet_grid(location_id~.)

It looks like the center closed for periods each year. That’s why there are no records. I will consider those as the missing value instead of assigning them to 0.

Also, I checked does all locations closed on the same day:

Trs %>% filter(location_id<=10) %>% 
  group_by(date) %>% 
  summarise(n=sum(!is.na(profit))) %>% `$`(n) %>% table
## .
##   0   1   2   3   4   5   6   7   8   9  10 
## 572  38  15  13   4   4   2   5  11  13 722

The answer is no. So I can’t assume the new location will shut down on the same days. Therefore, I will also predict the profit from all candidate locations when they should be closed.

Weekdays

The marginal relationship between weekdays and profit

Trs %>% ggplot(aes(x=profit)) +
  geom_histogram(bins = 40)+
  facet_grid(rows = as.factor(Trs$weekdays),
             scales = "free_y")

Since the distribution of profit on each weekday is similar. Also, the distribution of profits on Saturday and Sunday is similar. But workdays are different from weekends. So, divide the weekdays into two levels. The first level is workday and the second level is weekend.

Trs <- Trs %>% mutate(weekdays = ifelse(weekdays %in% c("Sat", "Sun"), "Weekend", "Workday"))

Summer and Winter

(Trs %>% filter(location_id<=10) %>% 
  ggplot(aes(x = as.Date(md, format = "%m-%d"),
                   y = profit,
                   color = as.factor(location_id))) +
  geom_point(alpha=0.5)+
  geom_vline(xintercept = as.Date(c("2022-05-01", "2022-10-01")))+
  labs(x = "Date")+
  scale_x_continuous(breaks = as.Date(c("2022-01-01", "2022-04-01", 
                                "2022-07-01", "2022-10-01",
                                "2023-01-01")),
                    labels = c("01-01", "04-01", 
                                "07-01", "10-01",
                                "01-01"))) %>% 
  ggplotly

I will add one feature to describe summer or winter since the profit pattern for summer and winter are different.

Trs$SumWin <- Trs$date %>% format("%m") %>% 
  as.numeric() %>% (function(n) ifelse(n<5, 1, 
                                ifelse(n<7, 2, 
                                ifelse(n<10, 3, 4)))) %>% 
  factor(labels = c("Winter2", "Summer1", "Summer2", "Winter1"))

Obviously, profits increase when rental locations are open after a long closure and decrease for a period of time before the next closure. I could add a new feature to cover this information, for example, how many days after the last closing and how many days before the next closing. But I don’t know the closure information for the new rental location. I will use the earliest open date and the latest close date as the open day and close day.

Trs %>% filter(!is.na(Trs$profit),
               as.numeric(format(date, "%m"))<5) %>% 
  `$`(date) %>% max()
## [1] "2022-03-18"
Trs %>% filter(!is.na(Trs$profit),
               as.numeric(format(date, "%m"))>=5) %>% 
  `$`(date) %>% min()
## [1] "2019-05-26"
Trs %>% filter(!is.na(Trs$profit),
               as.numeric(format(date, "%m"))<10) %>% 
  `$`(date) %>% max()
## [1] "2022-09-11"
Trs %>% filter(!is.na(Trs$profit),
               as.numeric(format(date, "%m"))>=10) %>% 
  `$`(date) %>% min()
## [1] "2019-11-29"
Trs$aftOpen <- 0
Trs$toClose <- 0

Trs$toClose[Trs$SumWin=="Winter2"] <- 
  Trs %>% filter(SumWin=="Winter2") %>% 
  mutate(close = as.Date(paste0(year, "-03-18")),
         toClose = ifelse(close-date<0,0,close-date)) %>% 
  `$`(toClose)

Trs$toClose[Trs$SumWin=="Summer2"] <- 
  Trs %>% filter(SumWin=="Summer2") %>% 
  mutate(close = as.Date(paste0(year, "-09-11")),
         toClose = ifelse(close-date<0,0,close-date)) %>% 
  `$`(toClose)

Trs$aftOpen[Trs$SumWin=="Winter1"] <- 
  Trs %>% filter(SumWin=="Winter1") %>% 
  mutate(open = as.Date(paste0(year, "-11-29")),
         toClose = ifelse(date-open<0,0,date-open)) %>% 
  `$`(toClose)

Trs$aftOpen[Trs$SumWin=="Summer1"] <- 
  Trs %>% filter(SumWin=="Summer1") %>% 
  mutate(open = as.Date(paste0(year, "-05-26")),
         toClose = ifelse(date-open<0,0,date-open)) %>% 
  `$`(toClose)

Holiday Table

I’m considering the impact of holidays on profit. Use the following plots to see if the days before and after a holiday will significantly differ from other times.

Trs %>% ggplot(aes(profit)) +
  geom_histogram(bins = 40) +
  facet_grid(rows = (Trs$date-1) %in% Hol$date, scales = "free_y")

Trs %>% ggplot(aes(profit)) +
  geom_histogram(bins = 40) +
  facet_grid(rows = (Trs$date+1) %in% Hol$date, scales = "free_y")

From these two histograms, I believe there is a significant difference. Then, add a new feature to include this information.

Trs$HolBef <- (Trs$date-1) %in% Hol$date
Trs$HolAft <- (Trs$date+1) %in% Hol$date

Weather table

By the introduction of this case study, there is a small amount of missing weather data. I want to fill in those missing weather data first.

Wet <- expand.grid(location_id=1:13,
            date = seq(from = as.Date("2019-01-01"),
                       to = as.Date("2022-10-30"),
                       by = "day")) %>% 
  left_join(Wet, by = c("location_id", "date")) %>% 
  as_tibble()
# Wet <- Wet %>% left_join(Loc, by = c("location_id"))

Temperature

Plot date vs. temperature:

Wet %>% ggplot(aes(x = date, y = temperature)) + geom_point() + facet_grid(rows = Wet$location_id)

This is periodic data. We can use a general periodic model or autoregression model to build the model and then predict the missing temperature. But, I will use a simple way, impute missing data by mean temperatures in 6 nearest neighbor days at the same location.

for (i in which(is.na(Wet$temperature))) {
  Wet$temperature[i] <- 
    mean(Wet$temperature[Wet$date>(Wet$date[i]-3) & 
                           Wet$date<(Wet$date[i]+3) & 
                           Wet$location_id==Wet$location_id[i]], na.rm = T)
}

Plot temperature vs. profit

Trs %>% left_join(Wet, by = c("location_id", "date")) %>% 
  filter(location_id<=10) %>% 
  ggplot(aes(temperature, profit))+
  geom_point()+geom_vline(aes(xintercept=40))+
  facet_wrap(~location_id)

Obviously, there are two groups. The trends in these two groups are different. There are no records around temperature=40. I will add a new feature to divide temperature into two levels.

Wet$tempLev <- ifelse(Wet$temperature>40, "Hot", "Cold") %>% factor()

Pressure

Plot date vs. pressure:

Wet %>% ggplot(aes(x = date, y = pressure, color = temperature)) + 
  geom_point() + facet_grid(rows = Wet$location_id)

There is no obvious pattern from these plots. I will assign mean value of all pressure to missing.

Wet <- Wet %>% mutate(pressure = ifelse(is.na(pressure), mean(Wet$pressure, na.rm = T), pressure))

Plot pressure vs. profit

Trs %>% left_join(Wet, by = c("location_id", "date")) %>% 
  filter(location_id<=10) %>% 
  ggplot(aes(pressure, profit))+geom_point()+
  facet_wrap(~location_id)

Actually, from these plots I don’t think pressure is a significant feature to predict profit. I will still add it to our model.

Humidity

Draw the histogram plot of humidity data. The distribution is a little bit strange. The lower the humidity, the fewer days, except 0.

Wet[order(Wet$date),] %>% filter(location_id == 1) %>% `$`(humidity) %>% hist(breaks=50)

I prefer to consider those 0 as missing value (I am not quite sure if this step is correct and need to check the source of the information).

Wet <- Wet %>% mutate(humidity = ifelse(humidity==0, NA, humidity))

Plot date vs. humidity:

Wet %>% ggplot(aes(x = date, y = humidity)) + 
  geom_point() + facet_grid(rows = Wet$location_id)

The distribution of humidity looks more complex. If I wanted to predict missing data more accurately, I have an idea. I may build a model as follows:

\[ \begin{aligned} \text{Humidity}&\sim\text{Unif}(a, 1)\\ a&=\beta_0+\beta_1\text{Temperature}+\epsilon\\ \epsilon&\sim N(\mu, \sigma) \end{aligned} \]

I won’t do it this time because that’s not the goal of this case study. I use a simple way, 3 nearest neighbors again.

for (i in which(is.na(Wet$humidity))) {
  Wet$humidity[i] <- 
    mean(Wet$humidity[Wet$date>(Wet$date[i]-3) & 
                        Wet$date<(Wet$date[i]+3) & 
                        Wet$location_id==Wet$location_id[i]], na.rm = T)
}

Plot humidity vs. profit

Trs %>% left_join(Wet, by = c("location_id", "date")) %>% 
  filter(location_id<=10) %>% 
  ggplot(aes(humidity, profit))+geom_point()+
  facet_wrap(~location_id)

Same conclusion with pressure.

Cloudy & Precipitation

Since these two variables are numeric, I can build a logistic regression model or use some machine learning model, for example, random forest or XGBoost, to do classification. But that is a big topic, and the information we used is inaccurate. So, I will add a new level of ‘Unknow’ in these two variables. In the model, this level will be considered the mean.

Wet <- Wet %>% mutate(cloudy=ifelse(is.na(cloudy), 
                                    "Unknow", cloudy),
         precipitation=ifelse(is.na(precipitation), 
                              "Unknow", precipitation)) %>% 
  mutate(cloudy=factor(cloudy, levels = c("Unknow", "FALSE", "TRUE"), 
                             labels = c("Unknow", "No", "Yes")),
               precipitation=factor(precipitation, 
                                    levels = c("Unknow", "FALSE", "TRUE"), 
                                    labels = c("Unknow", "No", "Yes")))

cloudy vs profit

Trs %>% left_join(Wet, by = c("location_id", "date")) %>% 
  ggplot(aes(profit))+geom_histogram(bins = 40)+
  facet_wrap(~cloudy, scales = "free_y")

precipitation vs profit

Trs %>% left_join(Wet, by = c("location_id", "date")) %>% 
  ggplot(aes(profit))+geom_histogram(bins = 40)+
  facet_wrap(~precipitation, scales = "free_y")

Correlationship

Trs %>% subset(select = c(location_id, date, profit)) %>% 
  left_join(Wet, by = c("location_id", "date")) %>% 
  subset(select=-c(location_id, date, tempLev)) %>% 
  ggpairs(ggplot2::aes(colour=Wet$tempLev, alpha=0.5))

Add more features

Combine all tables:

FullD <- Trs %>% left_join(Wet, by = c("location_id", "date"))

Add more features:
prev1Prof - the sum of profit on the previous 1 day
prev5Prof - the sum of profit in the previous 5 days
prev1Temp - the sum of temperature on the previous 1 day
prev5Temp - the sum of temperature on the previous 5 day
DofNoProfBef - How many consecutive days of no profit before the current day
  3 levels: 0; 1-4; >=5
prevDRain - Did it rain the day before
  3 levels: Unknow; No; Yes
DofRainfBef - How many consecutive days of rain before the current day
  3 levels: 0; 1-2; >=3

FullD$prev1Prof <- NA
FullD$prev5Prof <- NA
FullD$prev1Temp <- NA
FullD$prev5Temp <- NA
FullD$DofNoProfBef <- 0
FullD$prevDRain <- c(as.factor("Unknow"),FullD$precipitation[-nrow(FullD)])
FullD$DofRainfBef <- 0

FullD$DofRainfBef[FullD$prevDRain=="Yes"] = 1
for (i in 1:nrow(FullD)) {
  FullD$prev1Prof[i] <- sum(FullD$profit[FullD$location_id==FullD$location_id[i] & 
                                           FullD$date==(FullD$date[i]-1)], na.rm = T)
  FullD$prev5Prof[i] <- sum(FullD$profit[FullD$location_id==FullD$location_id[i] & 
                                           FullD$date>(FullD$date[i]-6) &
                                           FullD$date<FullD$date[i]], na.rm = T)
  
  FullD$prev1Temp[i] <- sum(Wet$temperature[Wet$location_id == FullD$location_id[i] & 
                                              Wet$date==(FullD$date[i]-1)])
  FullD$prev5Temp[i] <- sum(Wet$temperature[Wet$location_id==FullD$location_id[i] & 
                                            Wet$date>(FullD$date[i]-6) &
                                            Wet$date<FullD$date[i]])
  
  if((Wet$precipitation[Wet$location_id==FullD$location_id[i] & 
                        Wet$date>(FullD$date[i]-4) &
                        Wet$date<FullD$date[i]] == "Yes") %>% 
     sum() %>% `==`(3)){
    FullD$DofRainfBef[i] <- 3
  }
}
FullD$DofRainfBef <- factor(FullD$DofRainfBef, levels = c(0,1,3), 
                            labels = c("0", "1-2", ">=3"))

for (i in which(!FullD$date %in% (FullD$date+1))) {
  FullD$DofNoProfBef[i] = 1
  if((FullD$location_id==FullD$location_id[i] &
    FullD$date>(FullD$date[i]-6) &
    FullD$date<FullD$date[i]) %>% sum %>% `==`(0)){
    FullD$DofNoProfBef[i] = 5
  }
}

FullD$DofNoProfBef <- factor(FullD$DofNoProfBef, levels = c(0,1,5), 
                             labels = c("0", "1-4", ">=5"))

FullD <- FullD %>% filter(date>(as.Date("2019-01-01")+4))

Fit Model and Prediction

As I discussed before, I will use a linear regression model. I will use cross-validation to do the feature selection. There are some other ways, for example, stepwise.

First, randomly choose a training set. Fit the model by training set data. Then, try to predict profit at testing data. Compare with the true profit.

library(glmnet)

FullD_all <- FullD %>% filter(!is.na(profit)) %>% 
               subset(select=-c(location_id, date, n, year, md))

Cho <- sort(sample(1:nrow(FullD_all), 6000))
FullD_train <- FullD_all[Cho,]
FullD_test <- FullD_all[-Cho,]
  
FullDM_all <- model.matrix(profit~population+elevation+weekdays+
                         HolBef+HolAft+
                         SumWin+temperature+pressure+humidity+
                           SumWin:toClose+SumWin:aftOpen+
                         SumWin:temperature+
                         SumWin:pressure+
                         SumWin:humidity+
                         cloudy+precipitation+
                         SumWin:cloudy+SumWin:precipitation+
                         tempLev+
                         prev1Prof+prev5Prof+
                         prev1Temp+prev5Temp+
                         DofNoProfBef+prevDRain+DofRainfBef-1,
             data = FullD_all)
FullDM_train <- FullDM_all[Cho,]
FullDM_test <- FullDM_all[-Cho,]

fit_cv <- cv.glmnet(x = FullDM_train,y = FullD_train$profit)
fit1 <- glmnet(x = FullDM_train, y = FullD_train$profit,
              family = "gaussian", lambda = fit_cv$lambda.min)

Plot true profit vs. predicted profit

plot(FullD_train$profit,
     predict(fit1, newx = FullDM_train),
     main = "Training data")
lines(x = c(-100, 1000), y = c(-100, 1000), col = "red")

plot(FullD_test$profit,
     predict(fit1, newx = FullDM_test),
     main = "Testing data")
lines(x = c(-100, 1000), y = c(-100, 1000), col = "red")

True total profit vs. predicted total profit for testing data

sum(FullD_test$profit)
## [1] 441206.6
sum(predict(fit1, newx = FullDM_test))
## [1] 440785.9

I also tried to use data from location 1 to location 9 as training set. Fit the model by training set data. Then, try to predict profit at location 10. Compare with the true profit.

FullD_9 <- FullD %>% filter(!is.na(profit),
                            location_id < 10) %>% 
               subset(select=-c(location_id, date, n, year, md))

FullD_10 <- FullD %>% filter(!is.na(profit),
                            location_id == 10) %>% 
               subset(select=-c(location_id, date, n, year, md))
  
FullDM_all <- model.matrix(profit~population+elevation+weekdays+
                         HolBef+HolAft+
                         SumWin+temperature+pressure+humidity+
                           SumWin:toClose+SumWin:aftOpen+
                         SumWin:temperature+
                         SumWin:pressure+
                         SumWin:humidity+
                         cloudy+precipitation+
                         SumWin:cloudy+SumWin:precipitation+
                         tempLev+
                         prev1Prof+prev5Prof+
                         prev1Temp+prev5Temp+
                         DofNoProfBef+prevDRain+DofRainfBef-1,
             data = rbind(FullD_9,FullD_10))

FullDM_9 <- FullDM_all[1:nrow(FullD_9),]
FullDM_10 <- FullDM_all[-c(1:nrow(FullD_9)),]

fit_cv <- cv.glmnet(x = FullDM_9, y = FullD_9$profit)
fit2 <- glmnet(x = FullDM_9, y = FullD_9$profit,
              family = "gaussian", lambda = fit_cv$lambda.min)

Plot true profit vs. predicted profit

plot(FullD_9$profit,
     predict(fit2, newx = FullDM_9),
     main = "Training data")
lines(x = c(-100, 1000), y = c(-100, 1000), col = "red")

plot(FullD_10$profit,
     predict(fit2, newx = FullDM_10),
     main = "Testing data")
lines(x = c(-100, 1000), y = c(-100, 1000), col = "red")

True total profit vs. predicted total profit for testing data

sum(FullD_10$profit)
## [1] 217568.4
sum(predict(fit2, newx = FullDM_10))
## [1] 224459.2

Result

Use all data from locations 1 to 10 to fit the model. Then, predict profit in 3 candidate locations.

FullD_Know <- FullD %>% filter(!is.na(profit)) %>% 
               subset(select=-c(location_id, date, n, year, md))
FullD_Pred <- FullD %>% filter(location_id > 10, 
                               date >= as.Date("2022-01-01") & 
                                 date<=as.Date("2022-10-30")) %>% 
               subset(select=-c(location_id, date, n, year, md))

FullDM <- model.matrix(~population+elevation+weekdays+
                         HolBef+HolAft+
                         SumWin+temperature+pressure+humidity+
                         SumWin:temperature+
                         SumWin:pressure+
                         SumWin:humidity+
                         cloudy+precipitation+
                         SumWin:cloudy+SumWin:precipitation+
                         tempLev+
                         prev1Prof+prev5Prof+
                         prev1Temp+prev5Temp+
                         DofNoProfBef+prevDRain+DofRainfBef-1,
             data = rbind(FullD_Know, FullD_Pred))

FullDM_Know <- FullDM[1:nrow(FullD_Know),]
FullDM_Pred <- FullDM[-c(1:nrow(FullD_Know)),]

fit_cv <- cv.glmnet(x = FullDM_Know, y = FullD_Know$profit)
fit <- glmnet(x = FullDM_Know, y = FullD_Know$profit,
              family = "gaussian", lambda = fit_cv$lambda.min)

Pred <- FullD %>% filter(location_id > 10, 
                         date >= as.Date("2022-01-01") & 
                           date<=as.Date("2022-10-30")) %>% 
  subset(select=c(location_id, date)) %>% 
  data.frame(predict(fit, newx = FullDM_Pred)) %>% 
  `colnames<-`(c("location_id", "date", "predProf")) %>% 
  as_tibble() %>% 
  left_join(FullD %>% filter(location_id <= 10, 
                 date >= as.Date("2022-01-01") & 
                   date<=as.Date("2022-10-30")) %>% 
  group_by(date) %>% 
  summarise(closure = sum(!is.na(profit))/10), by = "date")

PredPlot <- Pred %>% filter(closure != 0) %>% 
  ggplot(aes(date, predProf, 
             color = as.factor(location_id))) +
  geom_point(alpha = (Pred %>% filter(closure != 0))$closure)

ggplotly(PredPlot)
Pred %>% group_by(location_id) %>% 
  summarise(predProf = sum(predProf*closure))
## # A tibble: 3 × 2
##   location_id predProf
##         <dbl>    <dbl>
## 1          11   17716.
## 2          12   17294.
## 3          13   16061.

From the predicted profit calculated before, the candidate location with location_id=11 has the largest profit from January 1st through October 30th of 2022.